Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add assert_count_true() to verify that an expected number of values are TRUE #573

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

billdenney
Copy link
Collaborator

Related to #539

I use a variant of this often in my data pipelines where I want to select a certain number of values and be sure that I'm only getting the number I expect. For example, I want to change one date for one person, and I want to be sure that I only change that one date:

# We are very young
my_data <- data.frame(name = c("Bill", "Sam"), birthdate = c("2024-05-22", "2024-05-22"))
my_data |>
  dplyr::mutate(
    birthdate =
      dplyr::case_when(
        assert_count_true(name == "Bill" & birthdate == "2024-05-22") ~ "2024-05-23",
        TRUE ~ birthdate
      )
  )

@billdenney billdenney requested a review from sfirke May 22, 2024 14:20
@sfirke
Copy link
Owner

sfirke commented Nov 25, 2024

@billdenney thanks for this and sorry to take so long to review!

Three things come to mind after reviewing:

1. Function name
Would it be cleaner to call this just assert_count ? I think the true is clear enough from the usage, but I don't feel strongly if you prefer the current way.

2. Providing a before-and-after example

I get the value of this function and would use it. But it took me reading each code block to understand its value. I think we can make it clearer upfront in the documentation (package vignette and function docs) that this doesn't change anything in the data; it's a wrapper that performs a data quality check mid-pipeline and halts things if an assertion fails.

The best way I can think of is to show a scenario with before and after example. For instance:


Imagine we have a column in a dataset that contains student test scores. A few scores were entered into the grade platform after the data file was exported, so we're filling them in manually during data cleaning. Ordinarily we might do that like this:

clean <- 
  raw |>
  dplyr::mutate(
    test_score =
      case_when(
        student_id == 12345 & is.na(test_score) ~ 84,
        student_id == 67890 & is.na(test_score) ~ 91,
        TRUE ~ test_score
      )
  )

But this has a risk: if in a later version of the data, one of these students is missing two test scores, the value will be imputed for both cases. assert_count_true defends against that. It wraps the statements, doing nothing if the assumed count is matched and throwing an error if not.

clean <- 
  raw |>
  dplyr::mutate(
    test_score =
      case_when(
        assert_count_true(student_id == 12345 & is.na(test_score), count = 1) ~ 84,
        assert_count_true(student_id == 67890 & is.na(test_score), count = 1) ~ 91,
        TRUE ~ test_score
      )
  )

3. janitor's approach in wading into the assertion space
I may be overthinking a little here but this feels like the start of a separate chapter, even if it's only one case for now. One implication is that I think we could carve out a little separate space in terms of function organization and documentation. Like this could go in a file assertions.R.

I am also considering how janitor complements the existing assertion functions out there. I looked at the assertr vignette and this blog post on 4 ways to be assertive in R.

To me the big difference is that assert_count_true does not take its own line in the pipeline, it wraps an existing line. I didn't see that in the other packages. You could make these same checks with assertr::assert() but at a cost of needing to write two lines for each replacement case: one to assert-check, one to replace. With your function, it's one line for each replacement, so there's less code to read and maintain.

I would argue this niche is unfilled as far as I know and that's this function's reason for being. Do you agree? If so maybe we briefly note that in the function or package docs so people understand when/why to use it.

@sfirke sfirke added this to the v2.3 milestone Nov 25, 2024
@billdenney
Copy link
Collaborator Author

@sfirke:

For point 2, yes, I like the example. I will incorporate it in some way (maybe a new assertion vignette).

For points 1 and 3, you understand all the ideas that I'm looking toward. I was thinking of the checkmate library which uses assert_* to ensure that aspects of input arguments are true, and Danielle Navarro's blog post (aside: she has an amazing blog!) shows other ways that this makes sense. I thought that it could be a good fit for janitor specifically because of my typical use case which you succinctly described in point 2. I use it to clean dirty data that may change over time or that I want to be sure I only capture the expected number of rows for, even if it doesn't change over time.

I agree that this would be a new chapter for janitor, and it deserves a new organizational layer. I'll work those in over the next few days. I also agree that the _true part of the function name can be assumed (and therefore removed from the function name).

@sfirke
Copy link
Owner

sfirke commented Dec 16, 2024

Hi @billdenney - friendly ping here. I am gonna try to submit to CRAN by the end of the year, I got another email from them this month saying it was my final notice to fix the package and resubmit. Think you'll be able to implement the above feedback? If not we could try for a fast-follow release and get a few more outstanding things in Spring 2025.

@billdenney
Copy link
Collaborator Author

@sfirke, I think that I covered all the to-do items here. Please take a look.

Also, I found a bug in either the commonmark or xml2 package while trying to add spell checking to the package. So, spell checking is only partially added (that's unrelated to the main purpose of the PR, but it seemed helpful sorry for making this a bigger change to review).

Copy link

codecov bot commented Dec 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (81702b6) to head (1cb9c4b).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #573   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           27        28    +1     
  Lines         1413      1428   +15     
=========================================
+ Hits          1413      1428   +15     
Files with missing lines Coverage Δ
R/assertions.R 100.00% <100.00%> (ø)
R/clean_names.R 100.00% <100.00%> (ø)
R/round_half_up.R 100.00% <ø> (ø)

@billdenney
Copy link
Collaborator Author

I merged the current main branch back in here. This should be ready to review again. When it's merged, we can head over to the bigger package separation PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants